99 research outputs found
DNN-Based Multi-Frame MVDR Filtering for Single-Microphone Speech Enhancement
Multi-frame approaches for single-microphone speech enhancement, e.g., the
multi-frame minimum-variance-distortionless-response (MVDR) filter, are able to
exploit speech correlations across neighboring time frames. In contrast to
single-frame approaches such as the Wiener gain, it has been shown that
multi-frame approaches achieve a substantial noise reduction with hardly any
speech distortion, provided that an accurate estimate of the correlation
matrices and especially the speech interframe correlation vector is available.
Typical estimation procedures of the correlation matrices and the speech
interframe correlation (IFC) vector require an estimate of the speech presence
probability (SPP) in each time-frequency bin. In this paper, we propose to use
a bi-directional long short-term memory deep neural network (DNN) to estimate a
speech mask and a noise mask for each time-frequency bin, using which two
different SPP estimates are derived. Aiming at achieving a robust performance,
the DNN is trained for various noise types and signal-to-noise ratios.
Experimental results show that the multi-frame MVDR in combination with the
proposed data-driven SPP estimator yields an increased speech quality compared
to a state-of-the-art model-based estimator
Comparison of Binaural RTF-Vector-Based Direction of Arrival Estimation Methods Exploiting an External Microphone
In this paper we consider a binaural hearing aid setup, where in addition to
the head-mounted microphones an external microphone is available. For this
setup, we investigate the performance of several relative transfer function
(RTF) vector estimation methods to estimate the direction of arrival(DOA) of
the target speaker in a noisy and reverberant acoustic environment. More in
particular, we consider the state-of-the-art covariance whitening (CW) and
covariance subtraction (CS) methods, either incorporating the external
microphone or not, and the recently proposed spatial coherence (SC) method,
requiring the external microphone. To estimate the DOA from the estimated RTF
vector, we propose to minimize the frequency-averaged Hermitian angle between
the estimated head-mounted RTF vector and a database of prototype head-mounted
RTF vectors. Experimental results with stationary and moving speech sources in
a reverberant environment with diffuse-like noise show that the SC method
outperforms the CS method and yields a similar DOA estimation accuracy as the
CW method at a lower computational complexity.Comment: Submitted to EUSIPCO 202
Covariance Blocking and Whitening Method for Successive Relative Transfer Function Vector Estimation in Multi-Speaker Scenarios
This paper addresses the challenge of estimating the relative transfer
function (RTF) vectors of multiple speakers in a noisy and reverberant
environment. More specifically, we consider a scenario where two speakers
activate successively. In this scenario, the RTF vector of the first speaker
can be estimated in a straightforward way and the main challenge lies in
estimating the RTF vector of the second speaker during segments where both
speakers are simultaneously active. To estimate the RTF vector of the second
speaker the so-called blind oblique projection (BOP) method determines the
oblique projection operator that optimally blocks the second speaker. Instead
of blocking the second speaker, in this paper we propose a covariance blocking
and whitening (CBW) method, which first blocks the first speaker and applies
whitening using the estimated noise covariance matrix and then estimates the
RTF vector of the second speaker based on a singular value decomposition. When
using the estimated RTF vectors of both speakers in a linearly constrained
minimum variance beamformer, simulation results using real-world recordings for
multiple speaker positions demonstrate that the proposed CBW method outperforms
the conventional BOP and covariance whitening methods in terms of
signal-to-interferer-and-noise ratio improvement.Comment: IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA), New Paltz, NY, USA, Oct 22-25, 202
Relative Transfer Function Vector Estimation for Acoustic Sensor Networks Exploiting Covariance Matrix Structure
In many multi-microphone algorithms for noise reduction, an estimate of the
relative transfer function (RTF) vector of the target speaker is required. The
state-of-the-art covariance whitening (CW) method estimates the RTF vector as
the principal eigenvector of the whitened noisy covariance matrix, where
whitening is performed using an estimate of the noise covariance matrix. In
this paper, we consider an acoustic sensor network consisting of multiple
microphone nodes. Assuming uncorrelated noise between the nodes but not within
the nodes, we propose two RTF vector estimation methods that leverage the
block-diagonal structure of the noise covariance matrix. The first method
modifies the CW method by considering only the diagonal blocks of the estimated
noise covariance matrix. In contrast, the second method only considers the
off-diagonal blocks of the noisy covariance matrix, but cannot be solved using
a simple eigenvalue decomposition. When applying the estimated RTF vector in a
minimum variance distortionless response beamformer, simulation results for
real-world recordings in a reverberant environment with multiple noise sources
show that the modified CW method performs slightly better than the CW method in
terms of SNR improvement, while the off-diagonal selection method outperforms a
biased RTF vector estimate obtained as the principal eigenvector of the noisy
covariance matrix.Comment: Proc. IEEE Workshop on Applications of Signal Processing to Audio and
Acoustics (WASPAA), New Paltz NY, USA, Oct. 202
Geometry-aware DoA Estimation using a Deep Neural Network with mixed-data input features
Unlike model-based direction of arrival (DoA) estimation algorithms,
supervised learning-based DoA estimation algorithms based on deep neural
networks (DNNs) are usually trained for one specific microphone array geometry,
resulting in poor performance when applied to a different array geometry. In
this paper we illustrate the fundamental difference between supervised
learning-based and model-based algorithms leading to this sensitivity. Aiming
at designing a supervised learning-based DoA estimation algorithm that
generalizes well to different array geometries, in this paper we propose a
geometry-aware DoA estimation algorithm. The algorithm uses a fully connected
DNN and takes mixed data as input features, namely the time lags maximizing the
generalized cross-correlation with phase transform and the microphone
coordinates, which are assumed to be known. Experimental results for a
reverberant scenario demonstrate the flexibility of the proposed algorithm
towards different array geometries and show that the proposed algorithm
outperforms model-based algorithms such as steered response power with phase
transform.Comment: Submitted to ICASSP 202
Modeling of Speech-dependent Own Voice Transfer Characteristics for Hearables with In-ear Microphones
Hearables often contain an in-ear microphone, which may be used to capture
the own voice of its user. However, due to ear canal occlusion the in-ear
microphone mostly records body-conducted speech, which suffers from
band-limitation effects and is subject to amplification of low frequency
content. These transfer characteristics are assumed to vary both based on
speech content and between individual talkers. It is desirable to have an
accurate model of the own voice transfer characteristics between hearable
microphones. Such a model can be used, e.g., to simulate a large amount of
in-ear recordings to train supervised learning-based algorithms aiming at
compensating own voice transfer characteristics. In this paper we propose a
speech-dependent system identification model based on phoneme recognition.
Using recordings from a prototype hearable, the modeling accuracy is evaluated
in terms of technical measures. We investigate robustness of transfer
characteristic models to utterance or talker mismatch. Simulation results show
that using the proposed speech-dependent model is preferable for simulating
in-ear recordings compared to a speech-independent model. The proposed model is
able to generalize better to new utterances than an adaptive filtering-based
model. Additionally, we find that talker-averaged models generalize better to
different talkers than individual models.Comment: 18 pages, 11 figures; Extended version of arXiv:2309.08294 (more
detailed description of the problem, additional models considered, more
systematic evaluation conducted on a different, larger dataset
- …